EASTR: Identifying and eliminating systematic alignment errors in multi-exon genes

Accurate alignment of transcribed RNA to reference genomes is a critical step in the analysis of gene expression, which in turn has broad applications in biomedical research and in the basic sciences. We reveal that widely used splice-aware aligners, such as STAR and HISAT2, can introduce erroneous spliced alignments between repeated sequences, leading to the inclusion of falsely spliced transcripts in RNA-seq experiments. In some cases, the ‘phantom’ introns resulting from these errors make their way into widely-used genome annotation databases. To address this issue, we present EASTR (Emending Alignments of Spliced Transcript Reads), a software tool that detects and removes falsely spliced alignments or transcripts from alignment and annotation files. EASTR improves the accuracy of spliced alignments across diverse species, including human, maize, and Arabidopsis thaliana, by detecting sequence similarity between intron-flanking regions. We demonstrate that applying EASTR before transcript assembly substantially reduces false positive introns, exons, and transcripts, improving the overall accuracy of assembled transcripts. Additionally, we show that EASTR’s application to reference annotation databases can detect and correct likely cases of mis-annotated transcripts.

To assess EASTR's capability to discriminate valid junctions within repetitive regions we focused on the human genome segments containing HERVs, and we used SpliceAI 1 (version 1.3.1) to score all splice junctions overlapping HERV elements extracted from the alignments of the same dataset of 23 DLPFC human RNA-seq samples referenced in the main manuscript.Out of the 4,287 junctions detected by HISAT and the 2,344 identified by STAR, which showed overlap with HERV elements at either donor and acceptor sites, EASTR removed 261 and 205 junctions, respectively.In total there were 375 HERV-to-HERV junctions identified by either HISAT or STAR that were removed by EASTR and 804 HERV-to-HERV junctions identified by either aligner that were kept.If EASTR is effective in discriminating spurious junctions from genuine ones, the SpliceAI scores should be higher among the kept junctions.Indeed, the junctions removed by EASTR had lower average SpliceAI scores: 0.05 and 0.009 for acceptors and donors, respectively, while the retained junctions had significantly higher average scores: an acceptor score of 0.3 and a donor score of 0.17.In addition, none of the removed junctions had either splice donor or splice acceptor scores above 0.6, while among the set of retained junctions, there were 37 such instances (details provided in Supplementary Data 16 and17).

Supplementary Note 2: Questionable intron in TCEANC transcript
While the majority of MANE-selected isoforms serve as the preferred representatives for their corresponding genes, Sommer et al. identified several noteworthy exceptions 2 .Our analysis proposes that TCEANC may be another such exception.Exonization events are infrequent, and consecutive Alu element exonization requires a minimum of four specific mutations 3 .Moreover, Alu elements are susceptible to non-allelic homologous recombination (NAHR) 4 , which can produce deletions masquerading as introns during spliced alignments (Supplementary Figure 1C), necessitating meticulous evaluation of intron splicing events.The GTEx 5 RNA-seq data contain uniquely aligned spliced reads at this junction, possibly indicating a deletion.Although prevalent structural variant databases, such as dbVar 6 and gnomAD 7 , report a chimeric Alu-producing duplication in this region (Supplementary Figure 1B), further investigation is required to validate this junction, as additional recombinations, including deletions, may remain undetected due to factors such as structural variant caller limitations with short reads and the abundance of Alu elements.Furthermore, we utilized SpliceAI 8 to evaluate the acceptor and donor splice sites.The evaluation included the entire intron and an additional 200bp sequence upstream the donor and 200bp sequence downstream the acceptor.Following the guidelines from the SpliceAI manual (https://github.com/Illumina/SpliceAI),we included an extra 5,000bp on both the donor and acceptor side, resulting in a 10,000bp of flanking sequence context.Consequently, the full length of a splice site input into the SpliceAI model is 10,400bp plus the intron length.We calculated the average score from the five trained models for each site.The results, presented in Supplementary Figure 2, demonstrate that the second putative exonization event is notably weak.

Supplementary Table 2: Transcript reconstruction precision and sensitivity metrics for different alignment approaches.
Sensitivity and precision of transcript assembly in 23 human RNA-seq samples.Note that the low sensitivity numbers are mostly due to the fact that not all genes from the annotation are expressed.

Supplementary Note 6: Analysis of shared spurious junctions across multiple samples
We analyzed 489 RNA-seq samples of heart tissues from GTEx in order to investigate whether spurious junctions detected by EASTR were shared across multiple samples or unique to individual ones.To this end, we extracted junctions from HISAT2 alignments of all samples and consolidated them into a single BED file.Only junctions supported by more than 5 alignments across the whole dataset were included.We then used EASTR to identify spurious junctions within this set.
As shown in Supplementary Figure 5 spurious junctions are often shared across samples, indicating a systematic pattern in these spurious alignments.This observation calls attention to the limitations of employing simple redundancy filtering techniques commonly used in studies, as it may not be adequate to remove shared alignment artifacts, and emphasizes the need for more refined methods to identify and mitigate such artifacts.